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_^ ■ Abstract. Dealing with unichainMDPs, we consider stationary distributions of policies 

^^ I that coincide in all but n states. In these states each policy chooses one of two possi- 

^D ' ble actions. We show that the stationary distributions of n + 1 such policies uniquely 

^^ I determine the stationary distributions of all other such policies. An explicit formula for 

calculation is given. 

P^ 
PLh 

c^ . 1. Introduction 

Definition 1.1. A Markov decision process (MDP) A4 on a (finite) set of states S witli 
a (finite) set of actions A available in each state G S consists of 

i^ . (i) an initial distribution fiQ that specifies the probability of starting in some state in 

m ■ s, 

^ I (ii) the transition probabilities Paihj) that specify the probability of reaching state j 

Q I when choosing action a in state i, and 

Q ! A (stationary) policy on A^ is a mapping ir : S -^ A. 

'^ 
-(— > 



Note that each policy tt induces a Markov chain on Ai. We are interested in MDPs, 
where in each of the induced Markov chains any state is reachable from any other state. 



Definition 1.2. An MDP A4 is called unichain, if for each policy tt the Markov chain 

'k>( I induced by % is ergodic, i.e. if the matrix P = (p^(j)(i, j))jjg5 is irreducible. 

Vh ' 

c^ I It is a well-known fact (cf. e.g. jTj, p.lSOff) that for an ergodic Markov chain with 

transition matrix P there exists a unique invariant and strictly positive distribution /x, 

such that independent of the initial distribution fiQ one has /i„ = /io-Pn — * fJ-, where 

2. Main Theorem and Proof 

Given n policies tti, 7r2, . . . , 7r„ we say that another policy tt is a combination oi 7:1,7:2, ... ,7^^, 
if for each state s one has 7r(s) = 7rj(s) for some i. 
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Actually, for aperiodic Markov chains one has even i^qP^ — > Mi while the convergence behavior of 
periodic Markov chains can be described more precisely. However, for our purposes the stated fact is 
sufficient. 
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Theorem 2.1. Let M. he a unichain MDP and tvi, 712,- ■ ■ ,7r„+i pairwise distinct policies 
on M. that coincide on all hut n states Si, S2, ■ ■ ■ , Sn- In these states each policy applies 
one of two possihle actions, i.e. we assume that for each i and each j either Tti{sj) = or 
T^i{sj) = 1. Then the stationary distrihutions of all comhinations of ni, 112,. . . ,TTn+i o'^e 
uniquely determined hy the stationary distrihutions fii of the policies iXi. 
More precisely, if we represent each comhined policy vr hy the word 7r(si)7r(s2) . . . 7r(s„), we 
may assume without loss of generality (hy swapping the names of the actions correspond- 
ingly) that the policy n we want to determine zs 11 ... 1. Let Sn he the set of permutations 
of the elements {1, . . . , n}. Then setting 

Tfc := {7 £ Sn+i I 7(A;) = n + 1 and 7rj(s^(j)) = for all j 7^ A;} 

one has for the stationary distrihution fi of n 

ES E^GPfe ^^'^(7) /^fc(s) n'J=l^ fJ'jiSyiJ)) 



Es'es ES E^er, ^^^^(7) t^k{s) YTj-^ t^jis^Uh 



/i(s) 



For clarification of Tlieorem |27TJ we proceed with an example. 

Example 2.2. Let A^ be a unichain MDP and vtooo, ttqiO) '^loi) ^110 policies on M. whose 
actions differ only in three states Si, S2 and S3. The subindices of a policy correspond 
to the word 7r(si)7r(s2)vr(s3), so that e.g. 7roio(si) = 7roio(s3) = and 7roio(s2) = 1- Now 
let /iooo; A^oiO; /^loi; ^nd /iiio be the stationary distributions of the respective policies. 
Theorem 12.11 tells us that we may calculate the distributions of all other policies that play 
in states Si, S2, S3 action or 1 and coincide with the above mentioned policies in all 
other states. In order to calculate e.g. the stationary distribution ^m of policy TTm in 
an arbitrary state s, we have to calculate the sets Fooo, Toio, Tioi, and Fhq. This can be 
done by interpreting the subindices of our policies as rows of a matrix. In order to obtain 
Ffc one cancels row k and looks for all possibilities in the remaining matrix to choose three 
Os that neither share a row nor a column: 

e-0-0 00 o oo 

1 W-& 10 10 10 
10 1 10 1 1-0^ 10 1 10 1 

110 110 110 ^^^ ^^^ 

Each of the matrices now corresponds to a permutation in F^, where k corresponds to 
the cancelled row. Thus Fqoo, Toio and Fioi contain only a single permutation, while Fhq 
contains two. The respective permutation can be read off each matrix as follows: note 
for each row one after another the position of the chosen 0, and choose n + 1 for the 
cancelled row. Thus the permutation for the third matrix is (2, 1,4,3). Now for each of 
the matrices one has a term that consists of four factors (one for each row). The factor 
for a row j is Hj{s'), where s' = s if row j was cancelled (i.e. j = k), or equals the state 
that corresponds to the column of row j in which the was chosen. Thus for the third 
matrix above one gets //ooo('52)/"oio(5i)//ioi(s)/iiio(s3). Finally, one has to consider the 
sign for each of the terms which is the sign of the corresponding permutation. Putting 
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all together, normalizing the output vector and abbreviating Oj := fioooisi), bi := /ioio(si), 
Cj := /iioi(si), and di := /iiio(si) one obtains 

_ /iooo(g) biC2d3 - aifioio{s) C2d^ - 02^1/^101 (■§) 4 + alhc2^^lw{s) - a3&iC2/iiio(g) 
&iC2rf3 — a-ic^d-i — 0261(^3 + ai^3C2 — a-ihic^ 
Theorem 12 .11 can be obtained from the following more general result where the station- 
ary distribution of a randomized policy is considered. 

Theorem 2.3. Under the assumptions of Theorem \2.1\ the stationary distribution // of 
the policy vr that plays in state Si (i = 1, . . . ,n) action with probability Aj G [0, 1] and 
action 1 with probability (1 — Aj) is given by 

ES E^eri ^Mt) /^fc(s) 11^=1^ filU)J) 



H{s) 



JT^fc 



J2s'es ES E^en ^Mt) f^k{s) YT.t^ f{l{]),]) ' 
where F'^ := {7 G S'„+i | 7(A;) = n + 1} and 

[(Aj- l)/ij(z), z/7rj(i) = 0. 
Theorem 12. II follows from Theorem 12 .HI by simply setting Aj = for z = 1, . . . , n. 

Proof of Theorem \2.^ Let S = {1,2,..., A^} and assume that Si = i for i = 1,2, ... ,n. 
We denote the probabilities associated with action with pij := Poihj) and those of 
action 1 with qij := pi{i, j). Furthermore, the probabilities in the states i = n + 1, . . . ,N, 
where the pohcies tti, . . . , 7r„+i coincide, are written as pij := P-Kk(i){.hj) as well. Now 
setting 

n+l n+1 

and p := {i^s)ses we are going to show that lyP.,^ = u, where P^ is the probability matrix 
of the randomized policy n. Since the stationary distribution is unique, normalization of 
the vector u proves the theorem. Now 

n N 

{pP^)s = X ^i{\Pis + (1 - \)qis) + 5Z ^*^*^ 

i=l i=n+l 

n n+1 n+1 

= XI 5Z XI ^Sn(7) /^fc(0 n /(7(i)>i)(Vis + (1 - Ai)gi,) 

i=l A:=l Tsr; J=i 

N n+1 n+1 

+ X ^^^^^^^)^^k{^)Y{f{l{3)^3)P^s■ 
i=n+l k=l 7Gr', J=l 

Since 

AT 

y^ f^k{'i)Pis = /Wfc(5) - X fJ'k{'i)Pis- X fJ'ki'i)qis, 
i=n+l 4:7rj.(i)=0 i:7rj.(i) = l 
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this gives 

n+l n+1 n 

fc=i -.gr', j=i i=i 

+/Ufc(s) - ^ ^ik{i)Pis- ^ Hk{i)qis 

j:7rfc(i)=0 J:7rfc(i) = l 

n+l n+l 

= i^s + X^ X] sgn(7) JJ/(7(j),j)(^ ^ /ifc(i)(Ai-l)(pi, -fc) 

fc = l 7er', j = l i:7rfe»=0 

n+l n+l n 

= ^s + ^Yl *^Sn(7) n f(^(J)^J) ^(Pis - Qis)f{h k) 

n n+l ra+1 

= ^^ + "^(Pis - lis) XI 5Z ^Sn(7) /(^> ^) n /(^(•^')'-^') 

i=l fc=l 7gr{ i = l 

Now it is easy to see that Yllt=i Xl-vpr' ^S^^It) /(^^ ^) 113=1 filij),j) =0: fix A; and some 
permutation 7 G F'^, and let / := 7"^(2). Then there is exactly one permutation 7' G F^, 
such that 7'(j) = 7(j) for j ^ k,l and 7'(/i;) = i. The pairs (A;, 7) and {l,^') correspond 
to the same summands 

n+l n+l 

- yet, since sgn(7) = — sgn(7'), they have different sign and cancel out each other. D 
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